How do deep-learning models generalize across populations? Cross-ethnicity generalization of COPD detection

Objectives To evaluate the performance and potential biases of deep-learning models in detecting chronic obstructive pulmonary disease (COPD) on chest CT scans across different ethnic groups, specifically non-Hispanic White (NHW) and African American (AA) populations. Materials and methods Inspiratory chest CT and clinical data from 7549 Genetic epidemiology of COPD individuals (mean age 62 years old, 56–69 interquartile range), including 5240 NHW and 2309 AA individuals, were retrospectively analyzed. Several factors influencing COPD binary classification performance on different ethnic populations were examined: (1) effects of training population: NHW-only, AA-only, balanced set (half NHW, half AA) and the entire set (NHW + AA all); (2) learning strategy: three supervised learning (SL) vs. three self-supervised learning (SSL) methods. Distribution shifts across ethnicity were further assessed for the top-performing methods. Results The learning strategy significantly influenced model performance, with SSL methods achieving higher performances compared to SL methods (p < 0.001), across all training configurations. Training on balanced datasets containing NHW and AA individuals resulted in improved model performance compared to population-specific datasets. Distribution shifts were found between ethnicities for the same health status, particularly when models were trained on nearest-neighbor contrastive SSL. Training on a balanced dataset resulted in fewer distribution shifts across ethnicity and health status, highlighting its efficacy in reducing biases. Conclusion Our findings demonstrate that utilizing SSL methods and training on large and balanced datasets can enhance COPD detection model performance and reduce biases across diverse ethnic populations. These findings emphasize the importance of equitable AI-driven healthcare solutions for COPD diagnosis. Critical relevance statement Self-supervised learning coupled with balanced datasets significantly improves COPD detection model performance, addressing biases across diverse ethnic populations and emphasizing the crucial role of equitable AI-driven healthcare solutions. Key Points Self-supervised learning methods outperform supervised learning methods, showing higher AUC values (p < 0.001). Balanced datasets with non-Hispanic White and African American individuals improve model performance. Training on diverse datasets enhances COPD detection accuracy. Ethnically diverse datasets reduce bias in COPD detection models. SimCLR models mitigate biases in COPD detection across ethnicities. Graphical Abstract


Introduction
Chronic obstructive pulmonary disease (COPD) poses a significant challenge in healthcare settings due to its nonreversible airway and/or alveolar abnormalities, leading to persistent airflow obstruction.Despite its global prevalence of 10.3% [1], COPD remains underdiagnosed and misdiagnosed [2], necessitating improved diagnostic strategies.The complexity of COPD diagnosis arises from its diverse clinical presentations influenced by biological, socioeconomic, and cultural factors, with racial and ethnic disparities further complicating management.
Recent reports from 2021 in the US reveal COPD prevalence at 6.2% in African American (AA) and non-Hispanic Black individuals, slightly lower than 6.5% in non-Hispanic Whites (NHW) and notably higher than 3.9% in Latino individuals [3].Cross-sectional studies consistently show AA individuals have lower lung function, up to 10-15% lower forced expiratory volume in 1 s (FEV1)) [4,5], attributed in part to anthropometric factors [4,6].COPD disparities extend to health-related quality of life, dyspnea severity, exercise capacity, and exacerbation rates, with AA individuals experiencing worsened outcomes compared to NHW [7,8].Imaging findings reflect these differences, with AA individuals showing less severe emphysema on CT scans despite matched lung function impairments [9].While race adjustments in spirometry reference equations have historically addressed these differences, recent perspectives advocate for race-neutral approaches to reduce potential biases in diagnosis and treatment, particularly in vulnerable populations [10][11][12][13][14][15].This evolving perspective necessitates a reconsideration of established COPD diagnostic practices that may perpetuate racial or ethnic bias.
Amidst these challenges, the emergence of artificial intelligence has offered promising avenues for COPD diagnosis and management.Particularly on the imaging diagnosis front, deep learning (DL) has played a crucial role in COPD early diagnosis and improved outcomes [16][17][18][19][20][21][22][23].However, concerns about potential racial bias in AI detection models have also surfaced as their capabilities unfold.
Recent studies [24,25] suggest that rather than mitigating bias, these AI models might exacerbate and perpetuate unfairness, particularly against specific subpopulations.The mechanisms through which bias is perpetuated are multifaceted.During training, datasets may inadvertently underrepresent certain patient groups or contain harmful correlations, leading to a distortion of model outcomes.What amplifies the significance of these concerns is the realization that human biases are encapsulated in the target labels used to train these models [26].Besides, the algorithm design may also have a higher tendency to learn and propagate such biases.Among the main categories of algorithm design are supervised learning (SL) and self-supervised learning (SSL) models.SL methods can inherit biases present in the labeled datasets [27], potentially perpetuating disparities in disease detection [25,28,29].SSL, on the other hand, are less susceptible to biases inherent in labeled data, as they rely on learning representations directly from unlabeled data, often through pretext tasks.This independence from biased labels is a significant advantage, potentially reducing the risk of perpetuating biases present in annotated datasets.However, it's crucial to note that SSL can still learn biases from the data itself, as well as from the design of the SSL task chosen.Even within the broader category of un-/selfsupervised learning, state-of-the-art models may, to some extent, still harbor biases associated with learned associations from the data [26,30].
Despite the growing significance of the issue, previous research has largely overlooked the potential ethnic biases encoded in common COPD imaging detection models, whether they employ SL or SSL techniques.Furthermore, the impact of such biases on the performance of these models remains unexplored.
In the face of this complex, multicausal issue, we investigated how COPD predictive models on chest CT, whether supervised or self-supervised, generalize across different ethnic populations.This exploration is specifically defined within the context of the largest COPD imaging dataset, Genetic epidemiology of COPD (COPDGene), serving as the focal point for our comprehensive inquiry.
Specifically, our exploration unfolds through three pivotal research questions: -Research Question

Study sample
Our study retrospectively analyzed COPDGene phase 1 study [31] (clinicaltrials.gov,NCT00608764; http:// www.copdgene.org/),which recruited current and former self-reported NHW and AA smokers (≥ 10 packyears), aged 45-80 years, between 2008 and 2011.Paired chest CT in inspiration (Insp) and expiration (Exp), pulmonary function tests, and questionnaires were collected per subject.Imaging data was acquired from different scanners and different manufacturers.Specific image acquisitions vary on the scanner model, which is available in [31,32].
To streamline the analysis and maintain simplicity, only inspiratory images were included in this study, as contrastive tasks have demonstrated robustness even without the inclusion of expiratory images [20].Pre-processing strategies followed the description of [20,21].

Subpopulation matching and data split
Differences in COPD prediction between NHW and AA, if any, could be related to confounding effects of demographic and risk factors variables.To limit the influence of such factors, a population of NHW was selected to match the AA population (NHW-matched), based on individuals with the same age, gender, and smoking duration (years).Having this in mind, to explore the effects of the training population, COPD prediction models were trained on the entire dataset (NHW and AA), AA only, NHW-matched only, and on a perfectly balanced set (half NHWmatched + half AA).
Differences in COPD prediction were evaluated on the test set splits of AA only and NHW-matched only.
Data splits for training, validation, and testing followed the same strategy as in [20,21], now applying it to the AA set.

COPD model prediction
Aiming to investigate the impact of SL and SSL on COPD binary classification performance, several models were evaluated.

Supervised learning (SL) models
For the evaluation of SL methods, we adopted three wellestablished voxel-based approaches: end-to-end patch classifier with a recurrent neural network (PatClass + RNN); multiple instance learning (MIL) with RNN as aggregation (MIL + RNN); attention-based MIL (MIL + Att).All methods are thoroughly described in the Supplementary Materials S-1.

Self-supervised learning (SSL) models
For the evaluation of SSL methods, three self-supervised contrastive tasks were compared (SimCLR, NNCLR, and context-aware NNCLR), having a fixed anomaly detection approach as a downstream task.These models are based on a recently proposed self-supervised anomaly detection method by Almeida SD et al [20,21] (cOOpD).This approach is founded on modeling the distribution of normal-lung regions utilizing contrastive latent representations and identifying deviations from this distribution as COPD-anomalous samples.In their approach, SimCLR [33] was used as the self-supervised contrastive model, as a pretext task to extract highly informative latent features per lung region.Subsequently, a generative model was applied to healthy regions from normal-lung-function subjects to discern the distribution of "normality."Out-of-distribution samples were assigned an anomaly score based on the negative log likelihood, enabling the identification of COPD regions.Patient-level labels were obtained by aggregating local-level scores.
To further enhance the richness of latent representations and extend beyond single instance positives, we adapted and compared the Almeida SD et al cOOpD method with two self-supervised pretext methods: nearest-neighbor contrastive learning approach (NNCLR) [34] and to a novel Context-Aware NNCLR (cNNCLR).
The NNCLR method introduces diversity in positive pairs by incorporating nearest neighbors sampled from a memory bank, aiming to increase the richness of latent representations and overcome limitations of pre-defined data augmentations.
The novel cNNCLR adaptation addresses concerns regarding disease-related sample selection by enforcing that nearest neighbors come from the same lung lobe and patient, leveraging spatial information for refined representations.This adaptation is particularly important given the subtle and heterogeneous pathological patterns observed in COPD.
For both NNCLR and cNNCLR, implementation configurations followed established strategies for random augmentations, encoder selection, and memory bank size, ensuring consistency with previous work [34].The same downstream task as the original Almeida SD et al [20,21] method was employed for all self-supervised pretext tasks.Further details about the method and implementations are available in the Supplementary Materials S-2 and S-3.Supplementary Fig. 1 illustrates the main differences between NNCLR and cNNCLR.
The code for the self-supervised models is available on a public repository on GitHub (https://github.com/MIC-DKFZ/cOOpD).

Statistical analysis
Model performance was assessed using the Area Under the Receiver Operator Curve (AUC) as the main evaluation metric.The Area Under the Precision Recall Curve (AUPRC) is also reported.Further details are available in Supplementary Materials S-4.Differences in test performance between AA and NHW were measured based on the AUROC.
Multiple linear regression analysis was performed to predict the AUC, based on the following independent variables: type of learning method (SL vs SSL), training configuration (AA, NHW, AA + NHW, AA + NHW balanced), and evaluation population (AA-only and NHW-only).Multiple linear regression was chosen to quantify the contribution of each predictor and their interactions, providing a comprehensive analysis of the effects of the learning method, training configuration, and evaluation population on the AUC.Corrections for multiple comparisons were addressed using the Holm-Bonferroni method.
The distribution of the anomaly scores generated by the SSL methods was compared using the Kolmogorov-Smirnov Test.The hypothesis is that the distributions of the individual binary classes (diseased/healthy) should be identical, independently of the ethnicity.Benjamini-Yekutieli correction was applied to the p values.
Statistical analyses were performed with R (version 4.2.3;R Foundation for Statistical Computing).A p value of < 0.05 was considered statistically significant.

Model performance
The differences in performance in terms of the AUC across models, training, and evaluation patient subgroups are summarized in Fig. 1 and in Supplementary Table 1.SSL methods generally outperform SL methods, with SL methods showing a lower average performance, irrespective of the training and evaluation configuration.Furthermore, AUC shows higher dispersion in SL models than in SSL.Overall, the best-performing combination is the NNCLR with the context framework applied to the large-scale dataset (NHW + AA all), followed by SimCLR.
Table 2 presents results from the multiple linear regression model.Interactions between the various predictors were also tested but since they were not significant, the model was refitted without interactions.As indicated in Table 2, the F-statistic p value is significant implying that at least one of the predictors (the type of learning, training configuration, and evaluation population) is significantly associated with the AUC.The  to training on the entire population (NHW + AA all) holding the type of learning and evaluation population constant.Similarly, no differences were found between the evaluation populations, holding the type of learning and the training configuration constant.

RQ1:
To what extent do NHW and AA experience similar prediction performance when COPD detection models are trained on large-scale datasets?
No statistically significant difference was found between the evaluation populations when holding the other  The second line provides information on the model fit, including the F-statistic.Reference categories are underlined * p value was significant after Holm-Bonferroni correction for multiple comparisons predictors constant.This indicates that NHW and AA individuals experience similar prediction performance, independently of the learning strategy and training configuration.Still, SL models trained with diverse data sources (NHW + AA all) exhibited larger mean performance differences between NHW and AA populations.Furthermore, this same training configuration (NHW + AA all) exhibited higher AUC than population-specific configurations (NHW-matched p = 0.01, tendency for AA-only n.s.), while no difference was found when compared with the balanced set (half NHW-matched + half AA).Therefore, although no difference was found for the COPD detection performance between AA and NHW, the performance is higher when models are trained on the entire (NHW + AA all) or on a balanced set (half NHWmatched + half AA).
RQ2: What impact does the choice of the training population have on the differences in test accuracies between NHW and AA?
Regardless of the training population, SL consistently demonstrates higher AUC when evaluated on the AA population, compared to NHW individuals.For SSL, there are instances where the AUC mean is higher when training on a population matched with the evaluation population (e.g., NHW-matched when evaluating on NHW).This effect is consistent across all models and configurations, except for NNCLR models.Although no statistically significant difference was found for the evaluation population, the training configuration has an impact on the overall AUC: including both NHW and AA patients in the training set improves the model's performance on both populations compared to training on a population-specific dataset.
RQ3: If differences exist, are these smaller for selfsupervised methods?
Figure 2 illustrates that SL generally exhibits lower performance and higher uncertainty in COPD prediction compared to SSL.Furthermore, SL trained on the entire population tends to demonstrate higher pronounced differences in performance between NHW and AA individuals.Conversely, SSL, while achieving higher mean AUC overall (p < 2e-16), also reveals greater discrepancies between ethnicities, particularly when trained on other population configurations.As presented in Table 3, the statistical analysis confirmed these qualitative observations.No statistically significant differences were found for cOOpD models, indicating similar distributions for AA and NHW, both for healthy and diseased patients, across different training configurations.For NNCLR, on the other hand, significant differences were found between the marginal distributions for AA and NHW healthy patients across all training configurations (p < 0.0001) for all.For diseased patients, no evidence of differences was found, except when training in NHW-only (p = 0.03).Finally, for cNNCLR, differences in distributions were found between ethnicities of healthy individuals when models were trained in AA-only (p = 0.02), NHW-only (p < 0.0001), and on the entire dataset (p = 0.003).No differences were found in cNNCLR for the diseased patient-wise anomaly score distribution in all cases and for healthy individuals when the model was trained on the balanced set (half NHWmatched + half AA).

Discussion
In this study, we compared DL models for COPD detection on chest CT scans across ethnic groups.SSL outperformed SL methods (p < 0.001), yielding higher AUC and lower uncertainty.Training on the entire COPDGene dataset produced better performance, with no significant differences compared to a balanced population.SL performed better on AA individuals, while SSL showed varying NHW-AA performance differences.However, SL trained on the full dataset exhibited larger performance gaps between AA and NHW.Including NHW and AAmatched patients improved performance and reduced differences, favoring SSL methods.In addition, SSL trained on balanced datasets showed more consistent anomaly score distributions across ethnicities, suggesting their potential to mitigate bias.These findings underscore the importance of considering ethnicity in model development and training to ensure equitable performance across diverse populations in COPD diagnosis.
While our study contributes significantly to understanding the performance and biases of DL models in COPD detection, it also sheds light on an important gap in the existing literature.The vast majority of fairness studies conducted to date have focused on pathology classification tasks within medical imaging [25,[35][36][37][38][39][40], with no attention paid to COPD diagnosis in minority classes.Despite the prevalence and significant healthcare burden associated with COPD, its diagnostic prediction performance across ethnicities remains understudied.Therefore, our work cannot be directly compared to other studies.However, studies from Glocker et al [25] and Seyyed-Kalantari et al [36] have evaluated bias in AI algorithms for various pathologies in chest X-rays.Parallelly to our findings, both studies highlight the presence of performance disparities and biases in AI models utilized for disease detection across various demographic subgroups, including biological sex, race, and, for the latter, socioeconomic status.Still, the effect of the training population and different types of learning strategies on pathology diagnosis has not been addressed.
Our findings also resonate with recent guidance from the American Thoracic Society (ATS) [41], which advocates for the adoption of race-neutral average reference equations in pulmonary function testing interpretation, while discouraging race and ethnicity adjustments.Our observations are consistent with these overarching goals, as, models trained on ethnic-specific datasets, exhibited, on average, larger differences in COPD prediction performance.On the other hand, on average, SSL exhibited fewer disparities in COPD prediction between different ethnic populations when models were trained on the entire or on balanced dataset.Our analysis of anomaly score distributions also revealed less statistically significant differences between ethnicities across healthy and diseased subjects when models are trained on the balanced dataset.This underscores the importance of leveraging ethnically diverse training datasets to enhance model robustness and mitigate potential biases.
The implications of our study are multifaceted and can inform future research and clinical practice in several key areas.First, our findings underscore the importance of evaluating DL models for medical applications across diverse demographic groups to ensure equitable performance and minimize biases.This highlights the need for comprehensive data collection efforts that include diverse populations to train models effectively and promote generalizability.Second, our study emphasizes the potential of SSL methods to mitigate biases and improve model performance in COPD detection.One possible reason for this improvement, over SL, is that SSL methods are likely circumventing biases that may be inherent in labeled datasets, thereby improving model generalization and reducing disparities across different demographic groups.SSL models excel in capturing nuanced patterns and variations in lung characteristics, including those influenced by demographic factors, leading to more robust and adaptable performance.Moreover, SSL mitigates the risk of overfitting to specific labeled examples, making it more resilient in real-world applications.In general, SSL can reduce the dependency on labor-intensive manual labeling and leverage the abundant unlabeled CT scans in the medical datasets, offering scalable solutions for improving COPD diagnosis and equity in healthcare outcomes.This suggests that investing in the development and evaluation of SSL approaches could yield significant benefits for improving COPD diagnostic accuracy and reducing disparities.In addition, our analysis underscores the importance of considering the choice of training data and its impact on model performance and bias.Finally, our study raises a critical consideration regarding the optimal balance between model performance and equity in healthcare outcomes.
The choice between a lower-performing model with reduced disparities between ethnic groups or a higherperforming model with some differences between them warrants further examination in the context of improving equitable access to healthcare for diverse populations.
There are some limitations to our study worth reporting.While we rigorously matched the subgroups for comparison, it's important to acknowledge the limitation regarding the inability to match other factors, such as the study site.Specifically, there were disproportionately fewer NHW individuals at study sites primarily serving AA individuals.Furthermore, while we focused on ethnicity as a key demographic variable, other factors such as socioeconomic status, education level, and environmental exposures were not addressed in our analysis.In addition, despite matching on smoking duration, discrepancies in smoking status (i.e., proportions of never-smokers, former smokers, and current smokers) between NHW and AA populations remain, influenced by differences in smoking initiation, cessation rates, cultural norms, and potential sampling variability within our study cohort.Future studies should aim to incorporate a more comprehensive set of demographic and clinical variables to better understand the complex interplay between patient characteristics and model performance.
In conclusion, our study highlights the significance of considering ethnicity in developing equitable COPD diagnostic models.We advocate for comprehensive data collection efforts and the exploration of SSL methods to mitigate biases and improve diagnostic accuracy across diverse populations, paving the way to ensuring equitable benefits for all population segments.
overall coefficient of determination (R 2 ) indicates how much the model explains the variance of the AUC.The contribution of each predictor (type of learning, training configuration, and evaluation population) on the dependent variable (AUC) is indicated by the respective β values and p values.SL methods had a significantly lower AUC (β = −18.90,p < 2e-16) compared to SSL, holding the training configuration and evaluation population constant.Training on the NHW-matched population resulted in a statistically significant lower AUC than training on NHW + AA all population (β = −4.09,p = 0.01).Although not significant, training on the AA-only population showed a lower AUC trend than the reference NHW + AA all population.No differences were found for training on the balanced set (half NHW-matched + half AA) compared

Fig. 1
Fig. 1 The schematic workflow of this study.A Main differences in COPD-related clinical characteristics between non-Hispanic Whites (NHW) and African-Americans (AA) and visual representation of normal and diseased regions on chest CT.The impact on COPD detection performance was assessed by the influence of two factors: B Training population (AA-only, NHW-matched-only, AA and NHW-matched, and AA and NHW all) and (C) Learning strategy (supervised learning [SL] and self-supervised learning [SSL]).D The impact is evaluated by comparing the Area Under the Receiver Operator Curve (AUC) per training configuration and learning strategy and by assessing the differences in distributions produced by the top-performing method

Fig. 2 Figure 3
Fig. 2 Supervised models show lower performance and higher uncertainty compared to self-supervised models.Comparison of COPD prediction performance across supervised (MIL + Att, MIL + RNN, PatchClass + RNN) and self-supervised (cOOpD, NNCLR, cNNCLR) models and across training and evaluation sub-ethnicity groups.Training subgroups are represented by color, while evaluation subgroups by linetype.Average classification performance across ethnic subgroups is shown in terms of the AUC (%), with error bars representing min-max values.The barplot on the top left corner represents mean AUC differences (NHW -AA) between models.Thus, positive bars represent higher prediction performance for models evaluated on NHW, compared to models evaluated on AA

Fig. 3
Fig. 3 Training on a matched, balanced population (half NHW-matched + half AA) shows fewer distribution shifts across ethnicities, for the same health condition.The SSL cOOpD model is revealed to be the best generalizable.Distribution shifts in patient-wise anomaly scores.Distributions of healthy (green) and COPD cases (orange) for AA individuals (full line) and NHW individuals (dotted line) are plotted across self-supervised models (cOOpD, nnCLR, cNNCLR), for four training configurations (AA-only, NHW-only, half NHW-matched + half AA, AA + NHW all).The plots were generated using all individuals in the test set group.Statistically significant differences, noted by "*" (**** < 0.0001, *** < 0.001, ** < 0.01, * < 0.05), measured by the Kolmogorov-Smirnov Test are displayed per condition: healthy (left) and for COPD (right) Examining the potential for unfairness in DL algorithms, whether due to the underrepresentation of minority populations in the training set or by the algorithm itself, is the first step for a comprehensive understanding of the intricate relationship between training population dynamics and algorithmic fairness in the realm of COPD predictive models.

Table 1
Demographic data and functional parameters for the analyzed COPDGene study sample, divided by ethnicity and by dataset split (training, evaluation, and testing)

Table 2
Multiple linear regression analysis to predict the main performance metric (AUC) with the following as independent variables: type of learning method (supervised vs self-supervised), training configuration (AA only, NHW-matched only, AA + NHW all, AA + NHW balanced) and evaluation population (NHW-matched only and AA only)

Table 3
Kolmogorov-Smirnov Tests for Comparing Distributions of ethnic evaluation populations across patient-wise anomaly score distributions for self-supervised models (cOOpD, NNCLR, cNNCLR)